Improving SMT with Morphology Knowledge for Baltic Languages

نویسنده

  • Raivis SKADIŅŠ
چکیده

In the recent years, several machine translation systems have been built for the Baltic languages. Besides Google and Microsoft machine translation engines and research experiments with statistical MT for Latvian [1] and Lithuanian, there are both English-Latvian [2] and English-Lithuanian [3] rulebased MT systems available. Both Latvian and Lithuanian are morphologically rich languages with quite free word order. In combination with the limited availability of parallel corpora for these languages, it poses a sparseness problem for phrase-based SMT. This research is a part of a project to build the best general-purpose phrase-based SMT using publicly available and proprietary corpora and tools. During the project we added language-specific knowledge to assess the possible improvement of translation quality. This paper reports on implementation, as well as automatic and human evaluation of EnglishLatvian and Lithuanian-English statistical machine translation systems. Results of human evaluation show that integrating morphology knowledge into SMT gives significant improvement of translation quality compared to baseline SMT.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving SMT for Baltic Languages with Factored Models

This paper reports on implementation and evaluation of English-Latvian and Lithuanian-English statistical machine translation systems. It also gives brief introduction of project scope – Baltic languages, prior implementations of MT and evaluation of MT systems. In this paper we report on results of both automatic and human evaluation. Results of human evaluation show that factored SMT gives si...

متن کامل

SMT of Latvian, Lithuanian and Estonian Languages: a Comparative Study

This paper is an attempt to discover the main challenges in working with Baltic and Estonian languages, and to identify the most significant sources of errors generated by a SMT system trained on large-vocabulary parallel corpora from legislative domain. An immense distinction between Latvian/Lithuanian and Estonian languages causes a set of non-equivalent difficulties which we classify and com...

متن کامل

Real-world challenges in application of MT for localization: the Baltic case

In this paper we share our experience from implementing machine translation in localization into relatively small languages of the three Baltic countries – Latvian, Lithuanian, and Estonian. We describe our approach in improving terminology translation and consistency by preprocessing of the source text and performing term integration. We present results of a formal evaluation of MT impact on t...

متن کامل

Modelling Linguistic Phenomena with Unsupervised Morphology for Improving Statistical Machine Translation

This work studies an ascetic approach to statistical machine translation. We assume that only a small parallel corpus is available, and no other monoor bilingual corpora or linguistic tools can be used, which is the case for many resource-scarce languages. Our aim is to find out how a baseline SMT system can be improved under this condition. In such a case one of the natural choices is to use u...

متن کامل

Using POS Information for SMT into Morphologically Rich Languages

When translating from languages with hardly any inflectional morphology like English into morphologically rich languages, the English word forms often do not contain enough information for producing the correct fullform in the target language. We investigate methods for improving the quality of such translations by making use of part-ofspeech information and maximum entropy modeling. Results fo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010